pacman::p_load(corrplot, ggpubr, sf, spdep, GWmodel, tmap, tidyverse,
gtsummary, kableExtra, skimr, funModeling, blorr, caret)In-class_EX05
Explaining Water Point Status in Osun State
In this in-class exercise, we develop an explanatory model for waterpoint status in the Osun state of Nigeria to understand what factors contribute to waterpoints being non-functional. Waterpoint status is a binary variable (functional or not functional), as such, we will use logistic regression for this exercise.
The explanatory variables to be used are listed below. Most of the variables are continuous and the last 3 variables are not categorical.
distance to primary road,
distance to secondary road,
distance to tertiary toad,
distance to city,
distance to town,
waterpoint population,
location population 1km,
usage capacity,
is urban,
watersource clean
1. Setting Up
Loading Packages
We will use the following packages:
sf and spdep: spatial data handling and spatial weights
tidyverse: manipulation of attribute data and plotting visualisations (aspatial)
skimr and funModeling: exploratory data analysis
caret: generating confusion matrices
tmap: creating map visualisations
corrplot and ggpubr: multivariate data analysis and visualisation
blorr: building logistic regression and performing diagnostics tests
GWmodel: calibrating geographical weighted family of models
gtsummary: create publication-ready analytical and summary tables
Loading Dataset
Now we load the geospatial data of the waterpoints in Osun state.
wp <- readRDS("data/osun_wp_sf.rds")glimpse(wp)Rows: 4,760
Columns: 75
$ row_id <dbl> 70578, 66401, 65607, 68989, 67708, 66419, 6…
$ source <chr> "Federal Ministry of Water Resources, Niger…
$ lat_deg <dbl> 7.759448, 8.031187, 7.891137, 7.509588, 7.4…
$ lon_deg <dbl> 4.563998, 4.637400, 4.713438, 4.265002, 4.3…
$ report_date <chr> "05/11/2015 12:00:00 AM", "04/30/2015 12:00…
$ status_id <chr> "No", "No", "No", "No", "Yes", "Yes", "Yes"…
$ water_source_clean <chr> "Borehole", "Borehole", "Borehole", "Boreho…
$ water_source_category <chr> "Well", "Well", "Well", "Well", "Well", "We…
$ water_tech_clean <chr> "Mechanized Pump", "Mechanized Pump", "Mech…
$ water_tech_category <chr> "Mechanized Pump", "Mechanized Pump", "Mech…
$ facility_type <chr> "Improved", "Improved", "Improved", "Improv…
$ clean_country_name <chr> "Nigeria", "Nigeria", "Nigeria", "Nigeria",…
$ clean_adm1 <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ clean_adm2 <chr> "Osogbo", "Odo-Otin", "Boripe", "Ayedire", …
$ clean_adm3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ clean_adm4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ install_year <dbl> NA, 2004, 2006, 2014, 2011, 2007, 2007, 200…
$ installer <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rehab_year <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rehabilitator <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ management_clean <chr> NA, NA, NA, NA, NA, "Community Management",…
$ status_clean <chr> "Abandoned/Decommissioned", "Abandoned/Deco…
$ pay <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ fecal_coliform_presence <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fecal_coliform_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ subjective_quality <chr> "Acceptable quality", "Acceptable quality",…
$ activity_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ scheme_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wpdx_id <chr> "6FV6QH57+QHH", "6FW62JJP+FXC", "6FV6VPR7+F…
$ notes <chr> "COSTAIN area, Osogbo", "Igbaye", "Araromi…
$ orig_lnk <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ photo_lnk <chr> "https://akvoflow-55.s3.amazonaws.com/image…
$ country_id <chr> "NG", "NG", "NG", "NG", "NG", "NG", "NG", "…
$ data_lnk <chr> "https://catalog.waterpointdata.org/dataset…
$ distance_to_primary_road <dbl> 167.82235, 4133.71306, 6559.76182, 8260.715…
$ distance_to_secondary_road <dbl> 838.9185, 1162.6246, 4625.0324, 5854.5731, …
$ distance_to_tertiary_road <dbl> 1181.107236, 9.012647, 58.314987, 2013.6515…
$ distance_to_city <dbl> 2449.2931, 16704.1923, 21516.8495, 31765.68…
$ distance_to_town <dbl> 9463.295, 5176.899, 8589.103, 16386.459, 23…
$ water_point_history <chr> "{\"2015-05-11\": {\"photo_lnk\": \"https:/…
$ rehab_priority <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ water_point_population <dbl> NA, NA, NA, NA, 0, 508, 162, 362, 3562, 217…
$ local_population_1km <dbl> NA, NA, NA, NA, 70, 647, 449, 1611, 7199, 2…
$ crucialness_score <dbl> NA, NA, NA, NA, NA, 0.785162287, 0.36080178…
$ pressure_score <dbl> NA, NA, NA, NA, NA, 1.69333333, 0.54000000,…
$ usage_capacity <dbl> 1000, 1000, 1000, 300, 1000, 300, 300, 300,…
$ is_urban <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALS…
$ days_since_report <dbl> 2689, 2700, 2688, 2693, 2701, 2692, 2681, 2…
$ staleness_score <dbl> 42.84384, 42.69554, 42.85735, 42.78986, 42.…
$ latest_record <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ location_id <dbl> 239556, 230405, 240009, 236400, 229231, 237…
$ cluster_size <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
$ clean_country_id <chr> "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "…
$ country_name <chr> "Nigeria", "Nigeria", "Nigeria", "Nigeria",…
$ water_source <chr> "Improved Tube well or borehole", "Improved…
$ water_tech <chr> "Motorised", "Motorised", "Motorised", "yet…
$ status <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRU…
$ adm2 <chr> "Osogbo", "Odo-Otin", "Boripe", "Aiyedire",…
$ adm3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ management <chr> NA, NA, NA, NA, NA, "Community Management",…
$ adm1 <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ `New Georeferenced Column` <chr> "POINT (4.5639983 7.7594483)", "POINT (4.63…
$ lat_deg_original <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lat_lon_deg <chr> "(7.7594483°, 4.5639983°)", "(8.0311867°, 4…
$ lon_deg_original <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ public_data_source <chr> "https://catalog.waterpointdata.org/datafil…
$ converted <chr> "#status_id, #water_source, #pay, #status, …
$ count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ created_timestamp <chr> "06/30/2020 12:56:07 PM", "06/30/2020 12:56…
$ updated_timestamp <chr> "06/30/2020 12:56:07 PM", "06/30/2020 12:56…
$ Geometry <POINT [m]> POINT (236239.7 417577), POINT (24463…
$ ADM2_EN <chr> "Osogbo", "Odo-Otin", "Boripe", "Aiyedire",…
$ ADM2_PCODE <chr> "NG030030", "NG030025", "NG030006", "NG0300…
$ ADM1_EN <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ ADM1_PCODE <chr> "NG030", "NG030", "NG030", "NG030", "NG030"…
Next, we need to load the dataset containing the boundaries of the administrative areas of Osun state.
osun <- readRDS("data/Osun.rds")glimpse(osun)Rows: 30
Columns: 5
$ ADM2_EN <chr> "Aiyedade", "Aiyedire", "Atakumosa East", "Atakumosa West",…
$ ADM2_PCODE <chr> "NG030001", "NG030002", "NG030003", "NG030004", "NG030005",…
$ ADM1_EN <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ ADM1_PCODE <chr> "NG030", "NG030", "NG030", "NG030", "NG030", "NG030", "NG03…
$ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((213526.6 34..., MULTIPOLYGON (…
Let’s plot the data out on a map.
tm_shape(osun) +
tm_polygons()+
tm_shape(wp)+
tm_dots()
2. Exploratory Data Analysis
We should check the distribution of the dependent variable of waterpoint status. We need to make sure that it is indeed binary and check how skewed the distribution is. If one outcome is much more common than others, the model may not be as effective at predicting the less common outcome.
freq(wp, input='status')Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
Please report the issue at <https://github.com/pablo14/funModeling/issues>.

status frequency percentage cumulative_perc
1 TRUE 2642 55.5 55.5
2 FALSE 2118 44.5 100.0
tmap_mode("view")tmap mode set to interactive viewing
tm_shape(osun) +
tm_polygons(alpha=0.4) +
tm_shape(wp) +
tm_dots(col="status",
alpha=0.6) +
tm_view(set.zoom.limits = c(9, 12))the skim() function of the skimr package to do quick exploratory data analysis. For categorical variables, it shows the number of the missing values and unique variable view. For binary variables, shows the number of missing values and gives a frequency count. For numercial fields, on top of missing values, it also shows some summary statistics like mean, standard deviation.
skim(wp)Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
| Name | wp |
| Number of rows | 4760 |
| Number of columns | 75 |
| _______________________ | |
| Column type frequency: | |
| character | 47 |
| logical | 5 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1.00 | 5 | 44 | 0 | 2 | 0 |
| report_date | 0 | 1.00 | 22 | 22 | 0 | 42 | 0 |
| status_id | 0 | 1.00 | 2 | 7 | 0 | 3 | 0 |
| water_source_clean | 0 | 1.00 | 8 | 22 | 0 | 3 | 0 |
| water_source_category | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| water_tech_clean | 24 | 0.99 | 9 | 23 | 0 | 3 | 0 |
| water_tech_category | 24 | 0.99 | 9 | 15 | 0 | 2 | 0 |
| facility_type | 0 | 1.00 | 8 | 8 | 0 | 1 | 0 |
| clean_country_name | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| clean_adm1 | 0 | 1.00 | 3 | 5 | 0 | 5 | 0 |
| clean_adm2 | 0 | 1.00 | 3 | 14 | 0 | 35 | 0 |
| clean_adm3 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| clean_adm4 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| installer | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| management_clean | 1573 | 0.67 | 5 | 37 | 0 | 7 | 0 |
| status_clean | 0 | 1.00 | 9 | 32 | 0 | 7 | 0 |
| pay | 0 | 1.00 | 2 | 39 | 0 | 7 | 0 |
| fecal_coliform_presence | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| subjective_quality | 0 | 1.00 | 18 | 20 | 0 | 4 | 0 |
| activity_id | 4757 | 0.00 | 36 | 36 | 0 | 3 | 0 |
| scheme_id | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| wpdx_id | 0 | 1.00 | 12 | 12 | 0 | 4760 | 0 |
| notes | 0 | 1.00 | 2 | 96 | 0 | 3502 | 0 |
| orig_lnk | 4757 | 0.00 | 84 | 84 | 0 | 1 | 0 |
| photo_lnk | 41 | 0.99 | 84 | 84 | 0 | 4719 | 0 |
| country_id | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| data_lnk | 0 | 1.00 | 79 | 96 | 0 | 2 | 0 |
| water_point_history | 0 | 1.00 | 142 | 834 | 0 | 4750 | 0 |
| clean_country_id | 0 | 1.00 | 3 | 3 | 0 | 1 | 0 |
| country_name | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| water_source | 0 | 1.00 | 8 | 30 | 0 | 4 | 0 |
| water_tech | 0 | 1.00 | 5 | 37 | 0 | 20 | 0 |
| adm2 | 0 | 1.00 | 3 | 14 | 0 | 33 | 0 |
| adm3 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| management | 1573 | 0.67 | 5 | 47 | 0 | 7 | 0 |
| adm1 | 0 | 1.00 | 4 | 5 | 0 | 4 | 0 |
| New Georeferenced Column | 0 | 1.00 | 16 | 35 | 0 | 4760 | 0 |
| lat_lon_deg | 0 | 1.00 | 13 | 32 | 0 | 4760 | 0 |
| public_data_source | 0 | 1.00 | 84 | 102 | 0 | 2 | 0 |
| converted | 0 | 1.00 | 53 | 53 | 0 | 1 | 0 |
| created_timestamp | 0 | 1.00 | 22 | 22 | 0 | 2 | 0 |
| updated_timestamp | 0 | 1.00 | 22 | 22 | 0 | 2 | 0 |
| Geometry | 0 | 1.00 | 33 | 37 | 0 | 4760 | 0 |
| ADM2_EN | 0 | 1.00 | 3 | 14 | 0 | 30 | 0 |
| ADM2_PCODE | 0 | 1.00 | 8 | 8 | 0 | 30 | 0 |
| ADM1_EN | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| ADM1_PCODE | 0 | 1.00 | 5 | 5 | 0 | 1 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| rehab_year | 4760 | 0 | NaN | : |
| rehabilitator | 4760 | 0 | NaN | : |
| is_urban | 0 | 1 | 0.39 | FAL: 2884, TRU: 1876 |
| latest_record | 0 | 1 | 1.00 | TRU: 4760 |
| status | 0 | 1 | 0.56 | TRU: 2642, FAL: 2118 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| row_id | 0 | 1.00 | 68550.48 | 10216.94 | 49601.00 | 66874.75 | 68244.50 | 69562.25 | 471319.00 | ▇▁▁▁▁ |
| lat_deg | 0 | 1.00 | 7.68 | 0.22 | 7.06 | 7.51 | 7.71 | 7.88 | 8.06 | ▁▂▇▇▇ |
| lon_deg | 0 | 1.00 | 4.54 | 0.21 | 4.08 | 4.36 | 4.56 | 4.71 | 5.06 | ▃▆▇▇▂ |
| install_year | 1144 | 0.76 | 2008.63 | 6.04 | 1917.00 | 2006.00 | 2010.00 | 2013.00 | 2015.00 | ▁▁▁▁▇ |
| fecal_coliform_value | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| distance_to_primary_road | 0 | 1.00 | 5021.53 | 5648.34 | 0.01 | 719.36 | 2972.78 | 7314.73 | 26909.86 | ▇▂▁▁▁ |
| distance_to_secondary_road | 0 | 1.00 | 3750.47 | 3938.63 | 0.15 | 460.90 | 2554.25 | 5791.94 | 19559.48 | ▇▃▁▁▁ |
| distance_to_tertiary_road | 0 | 1.00 | 1259.28 | 1680.04 | 0.02 | 121.25 | 521.77 | 1834.42 | 10966.27 | ▇▂▁▁▁ |
| distance_to_city | 0 | 1.00 | 16663.99 | 10960.82 | 53.05 | 7930.75 | 15030.41 | 24255.75 | 47934.34 | ▇▇▆▃▁ |
| distance_to_town | 0 | 1.00 | 16726.59 | 12452.65 | 30.00 | 6876.92 | 12204.53 | 27739.46 | 44020.64 | ▇▅▃▃▂ |
| rehab_priority | 2654 | 0.44 | 489.33 | 1658.81 | 0.00 | 7.00 | 91.50 | 376.25 | 29697.00 | ▇▁▁▁▁ |
| water_point_population | 4 | 1.00 | 513.58 | 1458.92 | 0.00 | 14.00 | 119.00 | 433.25 | 29697.00 | ▇▁▁▁▁ |
| local_population_1km | 4 | 1.00 | 2727.16 | 4189.46 | 0.00 | 176.00 | 1032.00 | 3717.00 | 36118.00 | ▇▁▁▁▁ |
| crucialness_score | 798 | 0.83 | 0.26 | 0.28 | 0.00 | 0.07 | 0.15 | 0.35 | 1.00 | ▇▃▁▁▁ |
| pressure_score | 798 | 0.83 | 1.46 | 4.16 | 0.00 | 0.12 | 0.41 | 1.24 | 93.69 | ▇▁▁▁▁ |
| usage_capacity | 0 | 1.00 | 560.74 | 338.46 | 300.00 | 300.00 | 300.00 | 1000.00 | 1000.00 | ▇▁▁▁▅ |
| days_since_report | 0 | 1.00 | 2692.69 | 41.92 | 1483.00 | 2688.00 | 2693.00 | 2700.00 | 4645.00 | ▁▇▁▁▁ |
| staleness_score | 0 | 1.00 | 42.80 | 0.58 | 23.13 | 42.70 | 42.79 | 42.86 | 62.66 | ▁▁▇▁▁ |
| location_id | 0 | 1.00 | 235865.49 | 6657.60 | 23741.00 | 230638.75 | 236199.50 | 240061.25 | 267454.00 | ▁▁▁▁▇ |
| cluster_size | 0 | 1.00 | 1.05 | 0.25 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | ▇▁▁▁▁ |
| lat_deg_original | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| lon_deg_original | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| count | 0 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▇▁▁ |
There are some missing values for 2 of the variables (water_point_population, local_population_1km). The following code chunk removes any observations with missing values in any of the explanatory variables. We create a list a of all the variables of interest and filter out any observations with missing values in any of the variables in this list.
expvars <- c("status","distance_to_primary_road", "distance_to_secondary_road",
"distance_to_tertiary_road", "distance_to_city",
"distance_to_town", "water_point_population",
"local_population_1km", "usage_capacity", "is_urban",
"water_source_clean")
wp_clean <- wp %>%
filter(!if_any(expvars, ~is.na(.x)))%>%
mutate(usage_capacity = as.factor(usage_capacity))Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(expvars)
# Now:
data %>% select(all_of(expvars))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Next, we need to check for multicollinearity. As we cannot o that on the data with geometry information, we need to strip the geometry information with st_set_geometry(NULL). We can then plot the correlation matrix.
wp_vars <- wp_clean %>%
select(expvars)%>%
st_set_geometry(NULL)vars.cor = cor(wp_vars[2:7])
corrplot.mixed(vars.cor,
lower="ellipse",
upper="number",
tl.pos = "lt",
diag="l",
tl.col = "black")
There are no variables that display multi-collinearity so we do not need to drop any. Now we can proceed to do a logistic regression. The first line of the code below creates the regression formula from the list of variables of interest using the paste() function. This method of creating a formula from lists is convenient when we have many explanatory variables so we don’t need to keep typing it out. We can then input the formula into the glm() function to perform the regression.
fm <- as.formula(paste("status ~",
paste(expvars[2:11], collapse="+")))
model <- glm(fm,
data=wp_clean,
family=binomial(link="logit"))The blr_regress() function of the blorr package creates a neat logistic regression report.
blr_regress(model) Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data status 4756 4755 4744 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 2114 1 2642
--------------------------------------------------------
Maximum Likelihood Estimates
-----------------------------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
-----------------------------------------------------------------------------------------------
(Intercept) 1 0.3887 0.1124 3.4588 5e-04
distance_to_primary_road 1 0.0000 0.0000 -0.7153 0.4744
distance_to_secondary_road 1 0.0000 0.0000 -0.5530 0.5802
distance_to_tertiary_road 1 1e-04 0.0000 4.6708 0.0000
distance_to_city 1 0.0000 0.0000 -4.7574 0.0000
distance_to_town 1 0.0000 0.0000 -4.9170 0.0000
water_point_population 1 -5e-04 0.0000 -11.3686 0.0000
local_population_1km 1 3e-04 0.0000 19.2953 0.0000
usage_capacity1000 1 -0.6230 0.0697 -8.9366 0.0000
is_urbanTRUE 1 -0.2971 0.0819 -3.6294 3e-04
water_source_cleanProtected Shallow Well 1 0.5040 0.0857 5.8783 0.0000
water_source_cleanProtected Spring 1 1.2882 0.4388 2.9359 0.0033
-----------------------------------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.7347 Somers' D 0.4693
% Discordant 0.2653 Gamma 0.4693
% Tied 0.0000 Tau-a 0.2318
Pairs 5585188 c 0.7347
---------------------------------------------------------------
There are 2 variables which are not statistically significant (p-value>0.05). They are not good predictors and should be considered. distance_to_tertiary_road, distance_to_city, distance_to_town and local_population_1km have positive coefficients, indicating that larger values correspond with higher possibility of a waterpoint being functional.
We should check the confusion matrix of the model to check the prediction accuracy. The blr_confusion_matrix() function can conveniently generate this information for us. We can also change the cutoff threshold (probability at which to classify the result as TRUE or FALSE).
blr_confusion_matrix(model, cutoff=0.5)Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
0 1301 738
1 813 1904
Accuracy : 0.6739
No Information Rate : 0.4445
Kappa : 0.3373
McNemars's Test P-Value : 0.0602
Sensitivity : 0.7207
Specificity : 0.6154
Pos Pred Value : 0.7008
Neg Pred Value : 0.6381
Prevalence : 0.5555
Detection Rate : 0.4003
Detection Prevalence : 0.5713
Balanced Accuracy : 0.6680
Precision : 0.7008
Recall : 0.7207
'Positive' Class : 1
The overall accuracy of the model is 67%. The model is better at predicting positives than negatives as the true positive rate (sensitivity) is higher than the true negative rate (specificity).
3. Geographically Weighted Logistic Regression
Next, we want to conduct a geographically weighted logistic regression.
wp_clean_sp <- wp_clean %>%
select(expvars) %>%
as_Spatial()
wp_clean_spclass : SpatialPointsDataFrame
features : 4756
extent : 182502.4, 290751, 340054.1, 450905.3 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs
variables : 11
names : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, usage_capacity, is_urban, water_source_clean
min values : 0, 0.014461356813335, 0.152195902540837, 0.017815121653488, 53.0461399623541, 30.0019777713073, 0, 0, 1000, 0, Borehole
max values : 1, 26909.8616132094, 19559.4793799085, 10966.2705628969, 47934.343603562, 44020.6393368124, 29697, 36118, 300, 1, Protected Spring
The next step is to create the spatial weights matrix. We need to use a distance-based spatial weights matrix to conduct the logistic regression. The following code chunk uses AIC to recommend the maximum distance to consider neighbours for a fixed distance matrix.
bw.fixed <- bw.ggwr(fm,
data=wp_clean_sp,
family = "binomial",
approach= "AIC",
kernel = "gaussian",
adaptive = FALSE,
longlat= FALSE)bw.fixed[1] 2599.672
The recommended maximum bandwidth for fixed distance matrix is 2599.672m.
gwlr.fixed <- ggwr.basic(fm,
data=wp_clean_sp,
bw = 2599.672,
family = "binomial",
kernel = "gaussian",
adaptive = FALSE,
longlat= FALSE) Iteration Log-Likelihood
=========================
0 -1958
1 -1676
2 -1526
3 -1443
4 -1405
5 -1405
gwlr.fixed ***********************************************************************
* Package GWmodel *
***********************************************************************
Program starts at: 2022-12-17 19:49:03
Call:
ggwr.basic(formula = fm, data = wp_clean_sp, bw = 2599.672, family = "binomial",
kernel = "gaussian", adaptive = FALSE, longlat = FALSE)
Dependent (y) variable: status
Independent variables: distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road distance_to_city distance_to_town water_point_population local_population_1km usage_capacity is_urban water_source_clean
Number of data points: 4756
Used family: binomial
***********************************************************************
* Results of Generalized linear Regression *
***********************************************************************
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-124.555 -1.755 1.072 1.742 34.333
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Intercept 3.887e-01 1.124e-01 3.459 0.000543
distance_to_primary_road -4.642e-06 6.490e-06 -0.715 0.474422
distance_to_secondary_road -5.143e-06 9.299e-06 -0.553 0.580230
distance_to_tertiary_road 9.683e-05 2.073e-05 4.671 3.00e-06
distance_to_city -1.686e-05 3.544e-06 -4.757 1.96e-06
distance_to_town -1.480e-05 3.009e-06 -4.917 8.79e-07
water_point_population -5.097e-04 4.484e-05 -11.369 < 2e-16
local_population_1km 3.451e-04 1.788e-05 19.295 < 2e-16
usage_capacity1000 -6.230e-01 6.972e-02 -8.937 < 2e-16
is_urbanTRUE -2.971e-01 8.185e-02 -3.629 0.000284
water_source_cleanProtected Shallow Well 5.040e-01 8.574e-02 5.878 4.14e-09
water_source_cleanProtected Spring 1.288e+00 4.388e-01 2.936 0.003325
Intercept ***
distance_to_primary_road
distance_to_secondary_road
distance_to_tertiary_road ***
distance_to_city ***
distance_to_town ***
water_point_population ***
local_population_1km ***
usage_capacity1000 ***
is_urbanTRUE ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6534.5 on 4755 degrees of freedom
Residual deviance: 5688.0 on 4744 degrees of freedom
AIC: 5712
Number of Fisher Scoring iterations: 5
AICc: 5712.099
Pseudo R-square value: 0.1295351
***********************************************************************
* Results of Geographically Weighted Regression *
***********************************************************************
*********************Model calibration information*********************
Kernel function: gaussian
Fixed bandwidth: 2599.672
Regression points: the same locations as observations are used.
Distance metric: A distance matrix is specified for this model calibration.
************Summary of Generalized GWR coefficient estimates:**********
Min. 1st Qu. Median
Intercept -8.7229e+02 -4.9955e+00 1.7600e+00
distance_to_primary_road -1.9389e-02 -4.8031e-04 2.9618e-05
distance_to_secondary_road -1.5921e-02 -3.7551e-04 1.2317e-04
distance_to_tertiary_road -1.5618e-02 -4.2368e-04 7.6179e-05
distance_to_city -1.8416e-02 -5.6217e-04 -1.2726e-04
distance_to_town -2.2411e-02 -5.7283e-04 -1.5155e-04
water_point_population -5.2208e-02 -2.2767e-03 -9.8875e-04
local_population_1km -1.2698e-01 4.9952e-04 1.0638e-03
usage_capacity1000 -2.0772e+01 -9.7231e-01 -4.1592e-01
is_urbanTRUE -1.9790e+02 -4.2908e+00 -1.6864e+00
water_source_cleanProtected.Shallow.Well -2.0789e+01 -4.5190e-01 5.3340e-01
water_source_cleanProtected.Spring -5.2235e+02 -5.5977e+00 2.5441e+00
3rd Qu. Max.
Intercept 1.2763e+01 1073.2156
distance_to_primary_road 4.8443e-04 0.0142
distance_to_secondary_road 6.0692e-04 0.0258
distance_to_tertiary_road 6.6815e-04 0.0128
distance_to_city 2.3718e-04 0.0150
distance_to_town 1.9271e-04 0.0224
water_point_population 5.0102e-04 0.1309
local_population_1km 1.8157e-03 0.0392
usage_capacity1000 3.0322e-01 5.9281
is_urbanTRUE 1.2841e+00 744.3099
water_source_cleanProtected.Shallow.Well 1.7849e+00 67.6343
water_source_cleanProtected.Spring 6.7663e+00 317.4133
************************Diagnostic information*************************
Number of data points: 4756
GW Deviance: 2795.084
AIC : 4414.606
AICc : 4747.423
Pseudo R-square value: 0.5722559
***********************************************************************
Program stops at: 2022-12-17 19:49:40
We can compare the global logistic regression and the GWLR by comparing the AICc. The AICc of the GWLR (4747.423) is lower than that of the global model (5712.099) , which means that the GWLR is a better explanatory model than the global model.
In order to assess the performance of the GWLR, we need to extract the output into a dataframe.
gwr.fixed <- as.data.frame(gwlr.fixed$SDF)We manually compute the predicted waterpoint status from the probability that the waterpoint is functional (yhat) using the threshold of 0.5 again.
gwr.fixed <- gwr.fixed %>%
mutate(most = ifelse(
yhat >= 0.5, T, F
))The following code chunk creates a confusion matrix by comparing the actual outcome with the predicted likely outcome.
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data=gwr.fixed$most,
positive= "TRUE",
reference = gwr.fixed$y)
CMConfusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 1824 263
TRUE 290 2379
Accuracy : 0.8837
95% CI : (0.8743, 0.8927)
No Information Rate : 0.5555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7642
Mcnemar's Test P-Value : 0.2689
Sensitivity : 0.9005
Specificity : 0.8628
Pos Pred Value : 0.8913
Neg Pred Value : 0.8740
Prevalence : 0.5555
Detection Rate : 0.5002
Detection Prevalence : 0.5612
Balanced Accuracy : 0.8816
'Positive' Class : TRUE
The accuracy of the local model is much higher than the global model (88% compared to 67%). The true positive (sensitivity) and true negative (specificity) rates have also improved compared to the global model. The local model is much better at explaining occurrence of non-functional waterpoints compared to the global model (specificity increased from 61% to 86%). This indicates that if we want to address the issue of non-functional waterpoints, it would be more effective to consider local factors.
We can plot the predicted outcome spatially by extracting the sdf output from the model as an sf object.
gwr.fixed.sf <- st_as_sf(gwlr.fixed$SDF)estprob <- tm_shape(osun)+
tm_polygons(alpha=0.1)+
tm_text(text="ADM2_EN")+
tm_shape(gwr.fixed.sf) +
tm_dots(col="yhat",
border.col = "gray60",
border.lwd = 1,
palette = "YlOrRd") +
tm_view(set.zoom.limits = c(9, 12))
actual <- tm_shape(osun)+
tm_polygons(alpha=0.1)+
tm_text(text="ADM2_EN")+
tm_shape(gwr.fixed.sf) +
tm_dots(col="y",
border.col = "gray60",
border.lwd = 1,
palette = c("#FFFFB2", "#BD0026")) +
tm_view(set.zoom.limits = c(9, 12))
tmap_arrange(actual, estprob,
asp=1, ncol=2,
sync = TRUE)As yhat is the probability of a waterpoint being functional, the lighter coloured dots have higher likelihood of being non-functional. It appears that there is some clustering of non-functional waterpoints, especially in Osogbo, Ifelodun and Boripe.
Refining the Model
We can further refine the model by removing the variables (distance_to_primary_road, distance_to_secondary_road ) that we previously identified as not statistically significant. To do that, we create a new formula without those variables. We then need to find the recommended fixed bandwidth for the formula.
fm2 <- as.formula(paste("status ~",
paste(expvars[4:11], collapse="+")))
bw.fixed <- bw.ggwr(fm2,
data=wp_clean_sp,
family = "binomial",
approach= "AIC",
kernel = "gaussian",
adaptive = FALSE,
longlat= FALSE)bw.fixed[1] 2377.371
The recommended fixed bandwidth is 2377.371m.
gwlr2.fixed <- ggwr.basic(fm2,
data=wp_clean_sp,
bw = 2377.371,
family = "binomial",
kernel = "gaussian",
adaptive = FALSE,
longlat= FALSE) Iteration Log-Likelihood
=========================
0 -1959
1 -1680
2 -1531
3 -1447
4 -1413
5 -1413
gwlr2.fixed ***********************************************************************
* Package GWmodel *
***********************************************************************
Program starts at: 2022-12-17 19:59:16
Call:
ggwr.basic(formula = fm2, data = wp_clean_sp, bw = 2377.371,
family = "binomial", kernel = "gaussian", adaptive = FALSE,
longlat = FALSE)
Dependent (y) variable: status
Independent variables: distance_to_tertiary_road distance_to_city distance_to_town water_point_population local_population_1km usage_capacity is_urban water_source_clean
Number of data points: 4756
Used family: binomial
***********************************************************************
* Results of Generalized linear Regression *
***********************************************************************
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-129.368 -1.750 1.074 1.742 34.126
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Intercept 3.540e-01 1.055e-01 3.354 0.000796
distance_to_tertiary_road 1.001e-04 2.040e-05 4.910 9.13e-07
distance_to_city -1.764e-05 3.391e-06 -5.202 1.97e-07
distance_to_town -1.544e-05 2.825e-06 -5.466 4.60e-08
water_point_population -5.098e-04 4.476e-05 -11.390 < 2e-16
local_population_1km 3.452e-04 1.779e-05 19.407 < 2e-16
usage_capacity1000 -6.206e-01 6.966e-02 -8.908 < 2e-16
is_urbanTRUE -2.667e-01 7.474e-02 -3.569 0.000358
water_source_cleanProtected Shallow Well 4.947e-01 8.496e-02 5.823 5.79e-09
water_source_cleanProtected Spring 1.279e+00 4.384e-01 2.917 0.003530
Intercept ***
distance_to_tertiary_road ***
distance_to_city ***
distance_to_town ***
water_point_population ***
local_population_1km ***
usage_capacity1000 ***
is_urbanTRUE ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6534.5 on 4755 degrees of freedom
Residual deviance: 5688.9 on 4746 degrees of freedom
AIC: 5708.9
Number of Fisher Scoring iterations: 5
AICc: 5708.923
Pseudo R-square value: 0.129406
***********************************************************************
* Results of Geographically Weighted Regression *
***********************************************************************
*********************Model calibration information*********************
Kernel function: gaussian
Fixed bandwidth: 2377.371
Regression points: the same locations as observations are used.
Distance metric: A distance matrix is specified for this model calibration.
************Summary of Generalized GWR coefficient estimates:**********
Min. 1st Qu. Median
Intercept -3.7021e+02 -4.3797e+00 3.5590e+00
distance_to_tertiary_road -3.1622e-02 -4.5462e-04 9.1291e-05
distance_to_city -5.4555e-02 -6.5623e-04 -1.3507e-04
distance_to_town -8.6549e-03 -5.2754e-04 -1.6785e-04
water_point_population -2.9696e-02 -2.2705e-03 -1.2277e-03
local_population_1km -7.7730e-02 4.4281e-04 1.0548e-03
usage_capacity1000 -5.5889e+01 -1.0347e+00 -4.1960e-01
is_urbanTRUE -7.3554e+02 -3.4675e+00 -1.6596e+00
water_source_cleanProtected.Shallow.Well -1.8842e+02 -4.7295e-01 6.2378e-01
water_source_cleanProtected.Spring -1.3630e+03 -5.3436e+00 2.7714e+00
3rd Qu. Max.
Intercept 1.3755e+01 2171.6375
distance_to_tertiary_road 6.3011e-04 0.0237
distance_to_city 1.5921e-04 0.0162
distance_to_town 2.4490e-04 0.0179
water_point_population 4.5879e-04 0.0765
local_population_1km 1.8479e-03 0.0333
usage_capacity1000 3.9113e-01 9.2449
is_urbanTRUE 1.0554e+00 995.1841
water_source_cleanProtected.Shallow.Well 1.9564e+00 66.8914
water_source_cleanProtected.Spring 7.0805e+00 208.3749
************************Diagnostic information*************************
Number of data points: 4756
GW Deviance: 2815.659
AIC : 4418.776
AICc : 4744.213
Pseudo R-square value: 0.5691072
***********************************************************************
Program stops at: 2022-12-17 19:59:48
All the variables at statistically significant at 5% significant level.
The median coefficient of distance_to_tertiary_road and local_population_1km were positive. This means that in general, the further away a waterpoint is from a tertiary road and the larger the population within 1km radius of the waterpoint, the more likely it is for a waterpoint to be functional.
The median coefficients for distance_to_city, distance_to_town and waterpoint population were negative. Meaning that in general, the further the waterpoint was to a city or town, the less likely they were to be functional. Waterpoints with more people depending on it were less likely to be functional. Waterpoints located in urban areas were also less likely to be functional than rural waterpoints.
The AICc of the local model (4744.213) is lower than that of the global model (5708.923). As such, it is a better explanatory model than the local model. The revised local model is slightly better than the first local model (AICc of 4747.423) after removing the explanatory variables that are not statistically significant.
We should also compare the confusion matrix of the new local model with the first local model.
gwr2.fixed <- as.data.frame(gwlr2.fixed$SDF) %>%
mutate(most = as.factor(
ifelse(
yhat >= 0.5, T, F)),
y = as.factor(y)
)CMConfusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 1824 263
TRUE 290 2379
Accuracy : 0.8837
95% CI : (0.8743, 0.8927)
No Information Rate : 0.5555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7642
Mcnemar's Test P-Value : 0.2689
Sensitivity : 0.9005
Specificity : 0.8628
Pos Pred Value : 0.8913
Neg Pred Value : 0.8740
Prevalence : 0.5555
Detection Rate : 0.5002
Detection Prevalence : 0.5612
Balanced Accuracy : 0.8816
'Positive' Class : TRUE
CM2 <- confusionMatrix(data=gwr2.fixed$most,
positive = "TRUE",
reference = gwr2.fixed$y)
CM2Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 1833 268
TRUE 281 2374
Accuracy : 0.8846
95% CI : (0.8751, 0.8935)
No Information Rate : 0.5555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7661
Mcnemar's Test P-Value : 0.6085
Sensitivity : 0.8986
Specificity : 0.8671
Pos Pred Value : 0.8942
Neg Pred Value : 0.8724
Prevalence : 0.5555
Detection Rate : 0.4992
Detection Prevalence : 0.5582
Balanced Accuracy : 0.8828
'Positive' Class : TRUE
The accuracy has increased slightly from the first model (0.8837) to the second model (0.8846). The second model is slightly worse at predicting functional waterpoints because the sensitivity (true positive rate) has decreased. However, the second model is slightly better at predicting non-functional waterpoints with a higher specificity or true negative rate (0.8671 >0.8628).
Another method to increase the true negative rate is to adjust the threshold value. The code chunk below calculates the confusion matrix if we increase the threshold value to 0.6. True negative rate or specificity is the number of true negatives out of actual negatives. Increasing the threshold value imposes a higher level of certainty to classify an observation as positive. This should reduce the number of negatives misclassified as positive and increase specificity.
Correspondingly, the true positive rate and overall accuracy will be adversely affected but we are more interested in explaining negatives so this may be an acceptable trade-off.
gwr2.fixed <- gwr2.fixed %>%
mutate(most2 = as.factor(
ifelse(
yhat >= 0.6, T, F))
)
CM3 <- confusionMatrix(data=gwr2.fixed$most2,
positive = "TRUE",
reference = gwr2.fixed$y)
CM3Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 1976 472
TRUE 138 2170
Accuracy : 0.8717
95% CI : (0.8619, 0.8811)
No Information Rate : 0.5555
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7443
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8213
Specificity : 0.9347
Pos Pred Value : 0.9402
Neg Pred Value : 0.8072
Prevalence : 0.5555
Detection Rate : 0.4563
Detection Prevalence : 0.4853
Balanced Accuracy : 0.8780
'Positive' Class : TRUE
As we can see, by increasing the threshold value, we can now predict 93% of non-functional waterpoints. There has been some decrease in true positive rate and accuracy as there will be more false negatives.
Now, let’s plot the predicted results of the second model spatially.
gwr2.fixed.sf <- st_as_sf(gwlr2.fixed$SDF) estprob <- tm_shape(osun)+
tm_polygons(alpha=0.1)+
tm_text(text="ADM2_EN")+
tm_shape(gwr2.fixed.sf) +
tm_dots(col="yhat",
border.col = "gray60",
border.lwd = 1,
palette = "YlOrRd") +
tm_view(set.zoom.limits = c(9, 12))
tmap_arrange(actual, estprob,
asp=1, ncol=2,
sync = TRUE)We should also directly compare the prediction result with the actual. The code chunk below adds the prediction results based on the 0.6 threshold level. We also add indicators for false negatives and false positives to see if the misclassifications show spatial depedency.
gwr2.fixed.sf <- gwr2.fixed.sf%>%
mutate(thres0.6 = as.factor(
ifelse(yhat >= 0.6, T, F)),
y = as.factor(y),
FP = ifelse(thres0.6==T & y==F, T, F),
FN = ifelse(thres0.6==F & y==T, T, F)
)pred0.6 <- tm_shape(osun)+
tm_polygons(alpha=0.1)+
tm_text(text="ADM2_EN")+
tm_shape(gwr2.fixed.sf) +
tm_dots(col="thres0.6",
border.col = "gray60",
border.lwd = 1,
palette = c("#FFFFB2", "#BD0026")) +
tm_view(set.zoom.limits = c(9, 12))
FN <- tm_shape(osun)+
tm_polygons(alpha=0.1)+
tm_text(text="ADM2_EN")+
tm_shape(filter(gwr2.fixed.sf, FN==T)) +
tm_dots(col="FN",
border.col = "gray60",
border.lwd = 1,
palette = c("#FFFFFF", "#000000")) +
tm_view(set.zoom.limits = c(9, 12))
FP <- tm_shape(osun)+
tm_polygons(alpha=0.1)+
tm_text(text="ADM2_EN")+
tm_shape(filter(gwr2.fixed.sf, FP==T)) +
tm_dots(col="FP",
border.col = "gray60",
border.lwd = 1,
palette = c("#FFFFFF", "#000000")) +
tm_view(set.zoom.limits = c(9, 12))
tmap_arrange(actual, pred0.6, FP, FN,
asp=1, ncol=2,
sync = TRUE)There appears to be some clustering of false negatives (false non-functional) in Osogbo and Ilesha West which is not present in the false positives map.
5. Conclusion
From the analysis above, we can can conclude that consideration of spatial dependency improves the prediction of functionality of waterpoints. An interesting result is that waterpoints that are further from tertiary roads (poor access in general) were more like to be functional, but those near from to cities/towns or in urban areas (good access to urban population) were more likely to be functional.
This is an interesting finding because it appears to be contradictory. Waterpoints in rural areas that were far from tertiary roads would be more difficult to access for maintenance but were more likely to be functional, which could indicate that excessive demand in rural areas is a bigger cause for concern to waterpoint functionality than maintenance. On the other hand, urban areas which have good excess and likely high demand are also more likely to be functional, possibly because easy access and larger dependent population means better maintenance regimes in these areas.
Further study on the local coefficients within the local model is needed to understand the trends. Another interesting variable to study would be the age of waterpoint or time since last maintenance as an indicator of state of depreciation.